Class 14: Interactive Data visualizations¶

Plan for today:

  • Quick review of seaborn
  • Discuss interactive graphics using plotly
  • If there is time: Discuss creating maps
InĀ [10]:
import YData

# YData.download.download_class_code(14)   # get class code    
# YData.download.download_class_code(14, True)  # get the code with the answers 

If you are using colabs, you should install the YData packages by uncommenting and running the code below and run the code below to mount the your google drive.

InĀ [11]:
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')
InĀ [12]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

Review of seaborn!¶

Seaborn is a Python data visualization library based onĀ matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

I.e., it is built on top of of matplotlib but produces better looking plots that are easier to create.

Let's start by examining different themes which can produce better looking plots. We can do this using the sns.set_theme() method.

InĀ [13]:
# Import seaborn
import seaborn as sns

# Apply the default theme
sns.set_theme()   # default style is 'darkgrid')
#sns.set_theme(style='whitegrid')

# Side note: Matplotlib also has themes
# plt.style.available
# plt.style.use('fivethirtyeight')

Penguins!¶

Let's get a little more practice with seaborn by continuing to explore the penguins data set.

InĀ [14]:
# Let's look at some penguins
penguins = sns.load_dataset("penguins")

penguins.head()
Out[14]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

Plotting a single quantitative variable using sns.displot()¶

We can plot a single quantitative variables using the sns.displot() function.

Properties we can set include

  • x: The name of the data column you want to plot
  • hue: The name of the column that colors each point
  • kind The type of plot

Different options for kind are: ā€œhistā€, ā€œkdeā€, ā€œecdfā€

Warm-up exercise 1¶

Please create a sns.displot() to create a visulation of flipped length, where each species is in a different color (i.e., different hue). Also, experiment with the "kind" of visualization and choose the kind you think creates the best visualization.

InĀ [15]:
# plot the flipper length
sns.displot(data = penguins, 
            x="bill_length_mm", 
            hue="species", 
            kind="hist");  # Experiment with "hist", "kde" and "ecdf"
No description has been provided for this image

Pairs plots¶

One of the most useful visualizations for exploring the relationships between several quantitative variables is to create a "pairs plot" which creates a series of scatter plots between all quantitative variables in the data. We can do this in seaborn using the sns.pairplot(data) function!

Warm-up exercise 2¶

Use the pairplot() function to visualize the relationships between all columns in the penguins DataFrame. Also, make each species have a different color.

InĀ [16]:
# Create pair plots for the different varaibles in the penguins data set

sns.pairplot(penguins, hue = "species");
No description has been provided for this image
No description has been provided for this image

Interactive data visualizations with plotly¶

Let's now look at interactive visualizations using the plotly express package.

Interactive visualizations can't be used with statitic report (such as the pdf used for your class project) but they are useful for exploring data to understand key trends, and these types of graphics can be embedded in webpages.

Let's start with our favoriate data set to visualize, the gapminder data! The gapminder data comes with the plotly package and can be loaded using the code below.

InĀ [17]:
import plotly.express as px

# Newly added
import plotly
plotly.offline.init_notebook_mode()


gapminder = px.data.gapminder()   # the plotly package comes with the gapminder data

print(type(gapminder))

gapminder.head(3)
<class 'pandas.core.frame.DataFrame'>
Out[17]:
country continent year lifeExp pop gdpPercap iso_alpha iso_num
0 Afghanistan Asia 1952 28.801 8425333 779.445314 AFG 4
1 Afghanistan Asia 1957 30.332 9240934 820.853030 AFG 4
2 Afghanistan Asia 1962 31.997 10267083 853.100710 AFG 4

Let's now get the the gapminder data from 2007. As you know, we can do this using Boolean masking. We can also do this using the .query() method!

InĀ [18]:
# Get the gapminder data from only 2007

gapminder_2007 = gapminder[gapminder['year'] == 2007]

gapminder_2007_alt = gapminder.query("year==2007")

gapminder_2007.equals(gapminder_2007_alt)
Out[18]:
True

Line plots¶

Let's create a line plot showing life expectancy as a function of the year using the px.line() method. In particular, let's set the followign properties of the plot:

  • x: Year
  • y: Life expectancy
  • color: The continent
  • line_group: The country
  • hover_name: The country
  • line_shape: spline
  • render_mode: svg to use svg graphics

What do you think of this plot?

InĀ [19]:
# Create an interactive line plot

fig = px.line(gapminder, x="year", y="lifeExp", 
              color="continent", 
              line_group="country", 
              hover_name="country",
              line_shape="spline", 
              render_mode="svg")

fig.show()

Scatter plots¶

Let's now recreate our scatter plot of country life expectancy as a function of GDP per capita using the gapminder_2007 data using plotly. In particularly, we can use the px.scatter(data_frame = , x = , y = , ...) method which works similar to seaborn's sns.relplot() function.

Let's try out the px.scatter(data_frame = , x = , y = , ...) function use the following mappings:

  • x: GDP per capita
  • y: Life Expectancy
  • size: The country population
  • color: Continent

We can also set the following properties:

  • hover_name: The name of the country
  • log_x: Set it to True to make the x-axis on a log10 scale
  • max_size: Set it to 60 to make the scaling for the population display better

Finally, if we want to have separate facets for columns we can use facet_col.

InĀ [20]:
# Create a scatter plot in plotly

fig = px.scatter(data_frame = gapminder_2007, 
                 x="gdpPercap", 
                 y="lifeExp", 
                 size="pop", 
                 color="continent",
                 hover_name="country", 
                 log_x=True, 
                 size_max=60)


# Add axis labels
fig.update_layout(xaxis_title="GDP per capita ($)",
                 yaxis_title="Life Expectancy")


fig.show()

Animations¶

We can also add animations to out plots using the following arguments:

  • animation_frame: defines which variable to animate over; i.e., each frame in the animation will be one value of this variable.

  • animation_group: Values from this column or array_like are used to provide object-constancy across animation frames: rows with matching animation_groups will be treated as if they describe the same object in each frame. This allows the animation to smoothly interpolate between frames.

We can also set the x and y ranges of our plots to match the ranges of data over the full animation sequence.

  • range_x: The range that the x-values should take
  • range_y: The range that the y-values should take
InĀ [21]:
# Create an animated scatter plot

fig = px.scatter(gapminder, 
                 x="gdpPercap", 
                 y="lifeExp", 
                 animation_frame="year", 
                 animation_group="country",
                 size="pop", 
                 color="continent", 
                 hover_name="country", 
                 facet_col="continent",
                 log_x=True, 
                 size_max=45, 
                 range_x=[100,100000], 
                 range_y=[25,90])


fig.show()

Additional visualizations¶

There are a number of other visualizations we can create using plotly. Let's briefly explore line graphs, sunburst plots and treemaps.

Please see the plotly express documentation to learn more about other plots you can create: https://plotly.com/python/plotly-express/

Sunburst plots¶

Sunburst is a generalization of a pie chart for data that has a hierarchical structure; i.e., it can plot categorical data that has a hierarchical structure.

Let's create a sunburst plot showing how much of the world's population is in each continent at the inner level, and then each country within each continent at the outer level. In particular, let's set the following properties:

  • path: Should be a list with continent at the inner level and country at the outer level.
  • values: Should specify that the angle of each segment is given by the countries population
  • color: Set to the countries' life expectancies

What do you think of this plot?

InĀ [22]:
# Create a sunburst plot

fig = px.sunburst(gapminder_2007, 
                  path=['continent', 'country'], 
                  values='pop', 
                  color='lifeExp')

fig.update_layout(width = 500, height = 500)

Treemap¶

Treemaps allow one to view hierarchical relationships by creating a sequence of nested rectangles. We can use plotly's px.treemap() function to create interactive tree maps.

Let's create an interactive treemap showing the population of each country separately for each continent, as well as color each country based on the average life expectancy. In particular, let's set the following properties:

  • path: Should be a list with continent at the highest level and country nested within continent. We can also set the first argument of the list to be px.Constant('world') so that at the highest level we get the label "world".
  • values: Should specify that the size of each rectangle is equal to a country's population
  • color: Set to the countries' life expectancies

What do you think of this plot?

InĀ [23]:
# Create a treemap

fig = px.treemap(gapminder_2007, 
                 path=[px.Constant('world'), 'continent', 'country'], 
                 values='pop', 
                 color='lifeExp')
                 #color='gdpPercap')

fig.show()

Pivot tables and heatmaps¶

Heatmaps allow us to view data that is a function of two variables.

In order to create a heatmap, we first need first transformat out data into a DataFrame that has appropirate rows and columns. One way we can do this is to use the pandas .pivot_table(index = , columns = , values = , aggfunc = ) method, where the arguments to this method are:

  • index: The variable we want in the rows of out DataFrame
  • columns: The variable we want in the columns of our DataFrame
  • values: The values we want to be in the DataFrame
  • aggfunc: The function we will use to aggregate our data

Let's apply the .pivot_table() method to our gapmider data to create a DataFrame called gapminder_continent_wide where:

  • The rows are the different continents
  • The columns are the year
  • The values in the DataFrame are the average life expectancy (For each continent in each year)
InĀ [24]:
# Generate a pivot table from the gapminder data

gapminder_continent_wide = gapminder.pivot_table(index = 'continent', 
                                                 columns = 'year', 
                                                 values = 'lifeExp', 
                                                 aggfunc = 'mean')
gapminder_continent_wide.head()
Out[24]:
year 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
continent
Africa 39.135500 41.266346 43.319442 45.334538 47.450942 49.580423 51.592865 53.344788 53.629577 53.598269 53.325231 54.806038
Americas 53.279840 55.960280 58.398760 60.410920 62.394920 64.391560 66.228840 68.090720 69.568360 71.150480 72.422040 73.608120
Asia 46.314394 49.318544 51.563223 54.663640 57.319269 59.610556 62.617939 64.851182 66.537212 68.020515 69.233879 70.728485
Europe 64.408500 66.703067 68.539233 69.737600 70.775033 71.937767 72.806400 73.642167 74.440100 75.505167 76.700600 77.648600
Oceania 69.255000 70.295000 71.085000 71.310000 71.910000 72.855000 74.290000 75.320000 76.945000 78.190000 79.740000 80.719500

Now that we have the appropriate DataFrame, let's use the plotly imshow() function to visualize it!

InĀ [25]:
# use plotly imshow() to visualize the pivot table

fig = px.imshow(gapminder_continent_wide)

fig.update_layout(xaxis_title = "Year", yaxis_title = "")
InĀ [26]:
# We can create heatmaps in seaborn as well

g = sns.heatmap(gapminder_continent_wide, 
                annot=True, 
                fmt=".0f");

g.set_xlabel("");
g.set_ylabel("");
No description has been provided for this image